170
Applications in Computer Vision
Algorithm 13 Training 1-bit detectors via LWS-Det.
Input:
The
training
dataset,
pre-trained
teacher
model.
Output:
1-bit
detec-
tor.
1: Initialize αi and βoi
i ∼N(0, 1) and other real-valued parameters layer-wise;
2: for i = 1 to N do
3:
while Differentiable search do
4:
Compute LAng
i
, LAmp
i
, LW
i
5:
end while
6: end for
7: Compute LGT , LLim
8: for i = N to 1 do
9:
Update parameters via back propagation
10: end for
We introduce the DARTS framework to solve Eq. 6.72, named differential binarization
search (DBS). We follow [151] to efficiently find wi. Specifically, we approximate wi by the
weighted probability of two matrices whose weights are all set as −1 and +1, respectively.
We relax the choice of a particular weight by the probability function defined as
pok
i
=
ok∈O
exp(βok
i )
o′
k∈O exp(β
o′
k
i )
, s.t. O = {w−
i , w+
i },
(6.73)
where pok
i
is the probability matrix belonging to the operation ok ∈O. The search space
O is defined as the two possible weights: {w−
i , w+
i }. For the inference stage, we select the
weight owning the max probability as
*wi,l = arg max
ok
pok
i,l,
(6.74)
where pok
i,l denotes the probability that the l-th weight of the i-th layer belongs to operation
ok. Therefore, the l -th weight of *w, that is, *wi,l, is defined by the operation having the
highest probability. In this way, we modify Eq. 6.87 by substituting wi to *wi as
LAng
i
= ∥
ai−1 ⊗wi
∥ai−1∥2∥wi∥2
−
ai−1 ⊙*wi
∥ai−1∥2∥*wi∥2
∥2
2.
(6.75)
By this, we retain the top-1 strongest operations (from distinct weights) for each weight
of wi in the discrete set {+1, −1}.
6.4.4
Learning the Scale Factor
After searching for wi, we learn the real-valued layers between the i-th and (i+1)-th 1-bit
convolution. We omit the batch normalization (BN) and activation layers for simplicity. We
can directly simplify Eq. 6.69 as
LAmp
i
= Ei(αi; wi, *wi, ai−1, ai−1).
(6.76)
Following conventional BNNs [77, 287], we employ Eq. 6.80 to further supervise the scale
factor αi. According to [235], we employ a fine-grained limitation of the features to aid in
the prior detection. Hence, the supervision of LWS-Det is formulated as
L = LGT + λLLim + μ
N
i=1
(LAng
i
+ LAmp
i
) + γ
N
i=1
Lw
i ,
(6.77)